Silent error detection in numerical time-stepping schemes

نویسندگان

  • Austin R. Benson
  • Sven Schmit
  • Robert Schreiber
چکیده

Errors due to hardware or low level software problems, if detected, can be fixed by various schemes, such as recomputation from a checkpoint. Silent errors are errors in application state that have escaped low-level error detection. At extreme scale, where machines can perform astronomically many operations per second, silent errors threaten the validity of computed results. We propose a new paradigm for detecting silent errors at the application level. Our central idea is to frequently compare computed values to those provided by a cheap checking computation, and to build error detectors based on the difference between the two output sequences. Numerical analysis provides us with usable checking computations for the solution of initial-value problems in ODEs and PDEs, arguably the most common problems in computational science. Here, we provide, optimize, and test methods based on Runge-Kutta and linear multistep methods for ODEs, and on implicit and explicit finite difference schemes for PDEs. We take the heat equation and Navier-Stokes equations as examples. In tests with artificially injected errors, this approach effectively detects almost all meaningful errors, without significant slowdown. 1 Silent errors and checking schemes 1.1 Silent errors are worrisome Computational scientists are concerned about silent errors in exascale computing. Silent errors are perturbations to application state that may lead to a failure such as a bad final solution [Snir et al., 2013]. These errors may arise from a bit flip, a firmware bug, data races, and other causes. Several authors ([Cappello et al., 2009, Dongarra et al., 2011, Snir et al., 2013]) have discussed the sources and the frequency of silent errors. Why the current concern? An exaflop machine will be able to do on the order of 10 operations per day, and will have on the order of 10 bytes of memory [Dongarra et al., 2011]. And in order to achieve very aggressive energy efficiency and performance targets, machine architects are pushing envelopes: with near threshold voltage logic, with new memory and storage technologies, and with photonic communication. Consumer quality hardware may already suffer errors at the personal computer scale once per year [Nightingale et al., 2011], and cost precludes really significant hardening of the hardware in supercomputers. Thus, the scale of systems makes such errors quite likely. Indeed, some high-performance systems today already suffer from silent errors at a troublesome rate [Shi et al., 2009]. 1.2 Algorithmic responses to silent errors The numerical algorithms community has already looked at error vulnerability. It is well known that many errors do not cause failures. Other errors lead to an obvious application failure. Silent Institute for Computational and Mathematical Engineering, Stanford University, Stanford, California, USA. ([email protected], [email protected]). HP Labs, Palo Alto, California, USA ([email protected]). We thank the US Department of Energy, which supported this work under Award Number DE SC0005026.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A General Solution for Implicit Time Stepping Scheme in Rate-dependant Plasticity

In this paper the derivation of the second differentiation of a general yield surface implicit time stepping method along with its consistent elastic-plastic modulus is studied. Moreover, the explicit, trapezoidal implicit and fully implicit time stepping schemes are compared in rate-dependant plasticity. It is shown that implementing fully implicit time stepping scheme in rate-dependant plasti...

متن کامل

Strongly stable multi-time stepping method with the option of controlling amplitude decay in responses

Recently, multi-time stepping methods have become very popular among scientist due to their high stability in problems with critical conditions. One important shortcoming of these methods backs to their high amount of uncontrolled amplitude decay. This study proposes a new multi-time stepping method in which the time step is split into two sub-steps. The first sub-step is solved using the well-...

متن کامل

Adaptive Weak Approximation of Diffusions with Jumps

This work develops adaptive time stepping algorithms for the approximation of a functional of a diffusion with jumps based on a jump augmented Monte Carlo Euler–Maruyama method, which achieve a prescribed precision. The main result is the derivation of new expansions for the time discretization error, with computable leading order term in a posteriori form, which are based on stochastic flows a...

متن کامل

Estimating Global Errors in Time Stepping*

This study introduces new strategies for global error estimation in time-stepping algorithms. The new methods propagate the defect along with the numerical solution much like the Zadunaisky procedure; however, the proposed approach allows for overlapped internal computations and, therefore, represents a generalization of the classical numerical schemes for solving differential equations with gl...

متن کامل

Goal-Oriented Error Estimation for the Fractional Step Theta Scheme

In this work, we derive a goal-oriented a posteriori error estimator for the error due to time-discretization of nonlinear parabolic partial differential equations by the fractional step theta method. This time-stepping scheme is assembled by three steps of the general theta method, that also unifies simple schemes like forward and backward Euler as well as the Crank–Nicolson method. Further, b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IJHPCA

دوره 29  شماره 

صفحات  -

تاریخ انتشار 2015